Goto

Collaborating Authors

 momentum distillation


Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

Neural Information Processing Systems

Large-scale vision and language representation learning has shown promising improvements on various vision-language tasks. Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens. Because the visual tokens and word tokens are unaligned, it is challenging for the multimodal encoder to learn image-text interactions. In this paper, we introduce a contrastive loss to ALign the image and text representations BEfore Fusing (ALBEF) them through cross-modal attention, which enables more grounded vision and language representation learning. Unlike most existing methods, our method does not require bounding box annotations nor high-resolution images.



Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models

Li, Yunxin, Liu, Zhenyu, Li, Zitao, Zhang, Xuanyu, Xu, Zhenran, Chen, Xinyu, Shi, Haoyuan, Jiang, Shenyuan, Wang, Xintong, Wang, Jifang, Huang, Shouzheng, Zhao, Xinping, Jiang, Borui, Hong, Lanqing, Wang, Longyue, Tian, Zhuotao, Huai, Baoxing, Luo, Wenhan, Luo, Weihua, Zhang, Zheng, Hu, Baotian, Zhang, Min

arXiv.org Artificial Intelligence

Reasoning lies at the heart of intelligence, shaping the ability to make decisions, draw conclusions, and generalize across domains. In artificial intelligence, as systems increasingly operate in open, uncertain, and multimodal environments, reasoning becomes essential for enabling robust and adaptive behavior. Large Multimodal Reasoning Models (LMRMs) have emerged as a promising paradigm, integrating modalities such as text, images, audio, and video to support complex reasoning capabilities and aiming to achieve comprehensive perception, precise understanding, and deep reasoning. As research advances, multimodal reasoning has rapidly evolved from modular, perception-driven pipelines to unified, language-centric frameworks that offer more coherent cross-modal understanding. While instruction tuning and reinforcement learning have improved model reasoning, significant challenges remain in omni-modal generalization, reasoning depth, and agentic behavior. To address these issues, we present a comprehensive and structured survey of multimodal reasoning research, organized around a four-stage developmental roadmap that reflects the field's shifting design philosophies and emerging capabilities. First, we review early efforts based on task-specific modules, where reasoning was implicitly embedded across stages of representation, alignment, and fusion. Next, we examine recent approaches that unify reasoning into multimodal LLMs, with advances such as Multimodal Chain-of-Thought (MCoT) and multimodal reinforcement learning enabling richer and more structured reasoning chains. Finally, drawing on empirical insights from challenging benchmarks and experimental cases of OpenAI O3 and O4-mini, we discuss the conceptual direction of native large multimodal reasoning models (N-LMRMs), which aim to support scalable, agentic, and adaptive reasoning and planning in complex, real-world environments.


Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

Neural Information Processing Systems

Large-scale vision and language representation learning has shown promising improvements on various vision-language tasks. Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens. Because the visual tokens and word tokens are unaligned, it is challenging for the multimodal encoder to learn image-text interactions. In this paper, we introduce a contrastive loss to ALign the image and text representations BEfore Fusing (ALBEF) them through cross-modal attention, which enables more grounded vision and language representation learning. Unlike most existing methods, our method does not require bounding box annotations nor high-resolution images.


Coarse-to-fine Alignment Makes Better Speech-image Retrieval

Zhou, Lifeng, Li, Yuke

arXiv.org Artificial Intelligence

In this paper, we propose a novel framework for speech-image retrieval. We utilize speech-image contrastive (SIC) learning tasks to align speech and image representations at a coarse level and speech-image matching (SIM) learning tasks to further refine the fine-grained cross-modal alignment. SIC and SIM learning tasks are jointly trained in a unified manner. To optimize the learning process, we utilize an embedding queue that facilitates efficient sampling of high-quality and diverse negative representations during SIC learning. Additionally, it enhances the learning of SIM tasks by effectively mining hard negatives based on contrastive similarities calculated in SIC tasks. To further optimize learning under noisy supervision, we incorporate momentum distillation into the training process. Experimental results show that our framework outperforms the state-of-the-art method by more than 4% in R@1 on two benchmark datasets for the speech-image retrieval tasks. Moreover, as observed in zero-shot experiments, our framework demonstrates excellent generalization capabilities.


Converting and Smoothing False Negatives for Vision-Language Pre-training

Byun, Jaeseok, Kim, Dohoon, Moon, Taesup

arXiv.org Artificial Intelligence

We consider the critical issue of false negatives in Vision-Language Pre-training (VLP), a challenge that arises from the inherent many-to-many correspondence of image-text pairs in large-scale web-crawled datasets. The presence of false negatives can impede achieving optimal performance and even lead to learning failures. To address this challenge, we propose a method called COSMO (COnverting and SMOoothing false negatives) that manages the false negative issues, especially powerful in hard negative sampling. Building upon the recently developed GRouped mIni-baTch sampling (GRIT) strategy, our approach consists of two pivotal components: 1) an efficient connection mining process that identifies and converts false negatives into positives, and 2) label smoothing for the image-text contrastive loss (ITC). Our comprehensive experiments verify the effectiveness of COSMO across multiple downstream tasks, emphasizing the crucial role of addressing false negatives in VLP, potentially even surpassing the importance of addressing false positives. In addition, the compatibility of COSMO with the recent BLIP-family model is also demonstrated.


Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

Li, Junnan, Selvaraju, Ramprasaath R., Gotmare, Akhilesh Deepak, Joty, Shafiq, Xiong, Caiming, Hoi, Steven

arXiv.org Artificial Intelligence

Large-scale vision and language representation learning has shown promising improvements on various vision-language tasks. Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens. Because the visual tokens and word tokens are unaligned, it is challenging for the multimodal encoder to learn image-text interactions. In this paper, we introduce a contrastive loss to ALign the image and text representations BEfore Fusing (ALBEF) them through cross-modal attention, which enables more grounded vision and language representation learning. Unlike most existing methods, our method does not require bounding box annotations nor high-resolution images. In order to improve learning from noisy web data, we propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model. We provide a theoretical analysis of ALBEF from a mutual information maximization perspective, showing that different training tasks can be interpreted as different ways to generate views for an image-text pair. ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks. On image-text retrieval, ALBEF outperforms methods that are pre-trained on orders of magnitude larger datasets. On VQA and NLVR$^2$, ALBEF achieves absolute improvements of 2.37% and 3.84% compared to the state-of-the-art, while enjoying faster inference speed. Code and pre-trained models are available at https://github.com/salesforce/ALBEF/.